I chose the “Medical Insurance” dataset. This dataset primarily records the different characteristics among members of health insurance companies and the premiums they currently pay. We know that in order to make a profit, insurance companies should collect higher premium than the amount paid to the insured person, so how to charge different fees for different types of customers is a very critical issue for insurance companies. I want to use the “insurance” dataset to explore which behaviors and performances of insurance company members have a significant impact on their premiums, so as to help insurance companies formulate the best pricing strategy.
The main methods used in this study:
After analyzing the relationship between each variable and between each variable and predictor, I decided to use linear regression to predict charges. Then, there were two models created with different numbers of variables. Moreover I evaluated the two models’ peformance and chose a better one. Finally, I created a prediction function based on the regression results and applied it to possible cases.
library(tidyverse)
library(MLmetrics)
library(Hmisc)
library(ggplot2)
library(ggthemes)
library(PerformanceAnalytics)
library(readr)
insurance <- read_csv("~/Desktop/insurance.csv")
head(insurance,5)
## # A tibble: 5 × 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
str(insurance)
## spec_tbl_df [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr [1:1338] "female" "male" "male" "male" ...
## $ bmi : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
## $ children: num [1:1338] 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr [1:1338] "yes" "no" "no" "no" ...
## $ region : chr [1:1338] "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_character(),
## .. bmi = col_double(),
## .. children = col_double(),
## .. smoker = col_character(),
## .. region = col_character(),
## .. charges = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
describe(insurance)
## insurance
##
## 7 Variables 1338 Observations
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 1338 0 47 0.999 39.21 16.21 18 19
## .25 .50 .75 .90 .95
## 27 39 51 59 62
##
## lowest : 18 19 20 21 22, highest: 60 61 62 63 64
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 1338 0 2
##
## Value female male
## Frequency 662 676
## Proportion 0.495 0.505
## --------------------------------------------------------------------------------
## bmi
## n missing distinct Info Mean Gmd .05 .10
## 1338 0 548 1 30.66 6.893 21.26 22.99
## .25 .50 .75 .90 .95
## 26.30 30.40 34.69 38.62 41.11
##
## lowest : 15.960 16.815 17.195 17.290 17.385, highest: 48.070 49.060 50.380 52.580 53.130
## --------------------------------------------------------------------------------
## children
## n missing distinct Info Mean Gmd
## 1338 0 6 0.899 1.095 1.275
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 574 324 240 157 25 18
## Proportion 0.429 0.242 0.179 0.117 0.019 0.013
## --------------------------------------------------------------------------------
## smoker
## n missing distinct
## 1338 0 2
##
## Value no yes
## Frequency 1064 274
## Proportion 0.795 0.205
## --------------------------------------------------------------------------------
## region
## n missing distinct
## 1338 0 4
##
## Value northeast northwest southeast southwest
## Frequency 324 325 364 325
## Proportion 0.242 0.243 0.272 0.243
## --------------------------------------------------------------------------------
## charges
## n missing distinct Info Mean Gmd .05 .10
## 1338 0 1337 1 13270 12301 1758 2347
## .25 .50 .75 .90 .95
## 4740 9382 16640 34832 41182
##
## lowest : 1121.874 1131.507 1135.941 1136.399 1137.011
## highest: 55135.402 58571.074 60021.399 62592.873 63770.428
## --------------------------------------------------------------------------------
# identify duplicate value
insurance[duplicated(insurance), ]
## # A tibble: 1 × 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 male 30.6 0 no northwest 1640.
insurance <- insurance %>%
distinct()
From the above information and combine with the backgroud of this dataset, we can get:
There are 7 variables, 1338 observations in this dataset, and there is no any missing values, only one duplicate value.
Age: insurance contractor age; from 18 to 64 years old; the average age is 39; continous variable.
Sex: insurance contractor gender; including female and male; the proportions of them are almost evenly distributed; categorical variable.
BMI: body mass index, a value calculated from a person’s mass (weight) and height; from 15.96 to 53.13; the average bmi is 30.66; continous variable.
Children: numbers of children insurance contractor has; from 0 to 5; most contractors have no more than 2 children; continous variable.
Smoker: show insurance contractor smoking or not; including yes and no; non-smokers are four times as likely as smokers; categorical variable.
Region: show insurance contractor’s living area in the US; including four areas-northeast, southeast, southwest, northwest; the proportions of them are almost evenly distributed; categorical variable.
Charges: Expenses to be paid by insurance contractor as set by insurance company; from 1121 to 63770; the average charges is 13270; continous variable and predict value.
age_charges <- ggplot(insurance, aes(age, charges)) +
geom_jitter(color = 'green', alpha = 0.5) + theme_clean() + ggtitle("Plot of Charges by Age")
age_charges
We can see that as the age increases, the charges also increase accordingly. At the same time, there are three clear linear positive correlation lines in this plot, but mainly concentrated on the bottom line.
BMI_charges <- ggplot(insurance, aes(bmi, charges)) +
geom_jitter(color = "orange", alpha = 0.5) + theme_clean() + ggtitle("Plot of Charges by BMI")
BMI_charges
Overall, charges increased with BMI. However, the trend was not as clear as age, so I decide to look more details of BMI.
After I did some research, I found that BMI could be classified into the following categories
insurance<- insurance %>%
mutate(bmi_cat = case_when(
bmi < 18.5 ~ "underweight",
bmi < 25 ~ "healthy weight",
bmi < 30 ~ "overweight",
bmi >= 30 ~ "obesity"
))
insurance$bmi_cat = factor(insurance$bmi_cat,
levels = c("underweight",
"healthy weight",
"overweight",
"obesity"))
insurance %>%
ggplot(aes(x = bmi, y = charges, color = bmi_cat)) +
geom_point() +
labs(title = "Plot Charges By BMI_cat",
x = "Bmi",
y = "Charges ($)") +
facet_wrap(~bmi_cat)
insurance %>%
ggplot(aes(x = age, y = charges, color = bmi_cat)) +
geom_point() +
labs(title = "Plot Charges By Age with BMI_cat",
x = "Age",
y = "Charges ($)") +
facet_wrap(~bmi_cat)
From the above plots, we can see that people with a bmi over 30 (obesity), especially order people, have higher charges than those with a bmi below 30.
children_charges <- ggplot(insurance, aes(children, charges)) +
geom_jitter(aes(color = children), alpha = 0.5) + theme_clean() + ggtitle("Plot of Charges by Children")
children_charges
From this plot, we get that in families with the number of children from 0 to 5, the charges for clients with five children is lower.
gender_charges <- ggplot(insurance,aes(sex,charges)) + geom_boxplot(fill = c("red", "blue")) +
theme_clean() + ggtitle("Plot of Charges by Gender")
gender_charges
Charges barely differ by gender.
smoker_charges <- ggplot(insurance,aes(smoker,charges)) + geom_boxplot(fill = c("green", "red")) +
theme_clean() + ggtitle("Plot of Charges by Smoke Status")
smoker_charges
The result, as I expected, showed that it was more likely have huge charges for smokers than non-smokers.
region_charges <- ggplot(insurance,aes(region,charges)) + geom_boxplot(fill = c(5:8)) +
theme_clean() + ggtitle("Plot of Charges per Region")
region_charges
There is little difference in charges in the four regions above.
Overall,by analyzing each of the above variables one by one, I initially found the much effect of age, BMI, smokers and number of children on charges. And smoker seems to be a very essential factor here.
So now, I would like to check how charges varies by Age, BMI, and number of Children according to smoker.
p1 <- ggplot(data = insurance, aes_string(x = 'age', y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) +
geom_jitter() +
geom_smooth(formula = y ~ x, method = "lm") +
ggtitle(glue::glue("Charges vs Age "))
p1
p2 <- ggplot(data = insurance, aes_string(x = 'bmi', y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) +
geom_jitter() +
geom_smooth(formula = y ~ x, method = "lm") +
ggtitle(glue::glue("Charges vs BMI "))
p2
p3 <- ggplot(data = insurance, aes_string(x = 'children', y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) +
geom_jitter() +
geom_smooth(formula = y ~ x, method = "lm") +
ggtitle(glue::glue("Charges vs Children "))
p3
we can see that even under the influence of different variables, the charges of smokers is always higher than that of non-smokers
Now, I am going to run the chart.Correlation to get an overview of correlation among variables and between each variable and charges.
insurance$sex[insurance$sex == "male"] <- 1
insurance$sex[insurance$sex == "female"] <- 0
insurance$smoker[insurance$smoker == "yes"] <- 1
insurance$smoker[insurance$smoker == "no"] <- 0
insurance<- insurance %>%
mutate(smoker = as.numeric(smoker)) %>%
mutate(sex = as.numeric(sex))
df <- insurance[, c(1,2,3,4,5,7)]
chart.Correlation(df, histogram=TRUE, pch=19)
From this plot, we can see that smoker has the highest correlation with charges. At the same time, we can see that the relationship between Age and Bmi and the relationship between Sex and Smoker are related. It makes sense. Generally speaking, older people are more likely to have a high BMI, and men are more likely to be smokers than women. But in overall, except for the predictor of charges, the correlation coefficient between the variables is very small.
set.seed(1)
row.number <- sample(1:nrow(insurance), round(0.8*nrow(insurance)))
train = insurance[row.number,]
test = insurance[-row.number,]
dim(train)
## [1] 1070 8
dim(test)
## [1] 267 8
mod1 <- lm(charges ~ age + bmi_cat + children + smoker + region + sex, train)
prediction1 <- predict(mod1,test)
summary(mod1)
##
## Call:
## lm(formula = charges ~ age + bmi_cat + children + smoker + region +
## sex, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12081.9 -3705.3 -299.5 1762.8 27502.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6122.07 1588.26 -3.855 0.000123 ***
## age 266.56 13.46 19.799 < 2e-16 ***
## bmi_cathealthy weight 1255.89 1558.63 0.806 0.420558
## bmi_catoverweight 2366.89 1535.70 1.541 0.123556
## bmi_catobesity 6225.18 1524.39 4.084 4.77e-05 ***
## children 473.10 157.38 3.006 0.002709 **
## smoker 23481.90 461.80 50.849 < 2e-16 ***
## regionnorthwest -701.90 540.75 -1.298 0.194560
## regionsoutheast -451.47 530.27 -0.851 0.394745
## regionsouthwest -1000.80 535.95 -1.867 0.062129 .
## sex -38.37 376.45 -0.102 0.918825
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6126 on 1059 degrees of freedom
## Multiple R-squared: 0.7494, Adjusted R-squared: 0.747
## F-statistic: 316.6 on 10 and 1059 DF, p-value: < 2.2e-16
From the first model we built, firstly, we can see that F(324.4) is far greater than 1. It can be concluded that there is a relationship between variables and predictor(charges) .
And as with our previous observations, Age, Bmi(obesity), Children and Smokers are the significant variables (have less p value) for charges.
Then, based on the Adjusted r-squared, we got 0.7516, which is close to 1. It indicates that our first model explains the 75.16% of the variation of charges and hence a good fit.
mod2<- lm(charges ~ age + bmi_cat + children + smoker, train)
prediction2 <- predict(mod2, test)
summary(mod2)
##
## Call:
## lm(formula = charges ~ age + bmi_cat + children + smoker, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12455.9 -3741.9 -290.8 1654.4 27551.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6570.69 1559.02 -4.215 2.71e-05 ***
## age 267.01 13.44 19.861 < 2e-16 ***
## bmi_cathealthy weight 1201.60 1554.71 0.773 0.4398
## bmi_catoverweight 2223.92 1529.48 1.454 0.1462
## bmi_catobesity 6097.65 1513.28 4.029 5.99e-05 ***
## children 462.88 157.19 2.945 0.0033 **
## smoker 23499.25 459.46 51.145 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6126 on 1063 degrees of freedom
## Multiple R-squared: 0.7485, Adjusted R-squared: 0.7471
## F-statistic: 527.2 on 6 and 1063 DF, p-value: < 2.2e-16
mae1 <- MAE(prediction1, test$charges)
rmse1 <- RMSE(prediction1, test$charges)
mod1_reg <- cbind("MAE" = mae1, "RMSE" = rmse1)
mod1_reg
## MAE RMSE
## [1,] 3976.93 5621.011
mae2 <- MAE(prediction2, test$charges)
rmse2 <- RMSE(prediction2, test$charges)
mod2_reg <- cbind("MAE" = mae2, "RMSE" = rmse2)
mod2_reg
## MAE RMSE
## [1,] 3966.14 5613.493
From this new model, we can see that F(541.5) is far greater than 1 and this value is more than the F value(324.4) of the previous model. It makes sense, because we removed the less significant feature.
The relationship between the number of children and charges is relatively insignificant in this new model compared to other variables, but Children still has a low P value.
For the Adjusted r-squared of model 2, we can see it a little bit higher but also quite similar to the first model’s.
Then compare the RMSE of the two models, we got that RMSE2(6479.594) is higher than RMSE1(6459.722). It means model2 has a relatively larger gap between the predicted and observed values.
Overall, I would say there is no big differences between the above two models, and I am going to take model 2 as our final regression model since it only has four observed values, which is more easier to execute following predict work.
prediction<- predict(mod2, test)
ggplot(test, aes(x = prediction, y = charges)) + geom_point(color = "black", alpha = 0.5) + geom_abline(color = "red") +
ggtitle("Prediction vs. Actual values")
Because each data point is quite close to the projected regression line, especially within 20000 charges, we may conclude that our model 2 fits the data reasonably well.
Based on the model2, the equation will be like: charges = -6305.54 +265.39 (age) + 5591.39(bmi_obesity) + 461.11(children) + 23422.51(smoker).
predict_charge <- function(age, bmi, children, smoker){
if(smoker == "yes" && bmi < 30){
result = -6305.54 + (265.39*age) + (461.11*children) + 23422.51
} else if(smoker == "yes" && bmi >= 30) {
result = -6305.54 + (265.39*age) + (461.11*children) + (5591.39*bmi) + 23422.51
} else if(smoker == "no" && bmi < 30){
result = -6305.54 + (265.39*age) + (461.11*children)
} else if(smoker == "no" && bmi >= 30){
result = -6305.54 + (265.39*age) + (461.11*children) + (5591.39*bmi)
}
return(result)
}
Tom:35 years old; bmi is 34.4; 0 child; smoke.
We can get his charge supposed to be:
predict_charge(35,34.4,0,"yes")
## [1] 218749.4
After analysis I found that smoking is the most important factor in increasing premiums.
At the same time, with the increase of age and BMI, the premium should also increase accordingly, especially when the insurer’s BMI exceeds 30 (obesity).
Therefore, we are not able to change our age for sure, but we can try to keep fit and avoid smoking. That’s good for your wallet and your health;)